Efficient Image Captioning for Edge Devices
نویسندگان
چکیده
Recent years have witnessed the rapid progress of image captioning. However, demands for large memory storage and heavy computational burden prevent these captioning models from being deployed on mobile devices. The main obstacles lie in heavyweight visual feature extractors (i.e., object detectors) complicated cross-modal fusion networks. To this end, we propose LightCap, a lightweight captioner resource-limited core design is built recent CLIP model efficient be specific, one hand, leverage to extract compact grid features without relying time-consuming detectors. On other transfer image-text retrieval scenarios by devising novel concept extractor modulator. We further optimize parallel prediction heads via sequential ensemble distillations. With carefully designed architecture, our merely contains 40M parameters, saving size more than 75% FLOPs 98% comparison with current state-of-the-art methods. In spite low capacity, still exhibits performance prevalent datasets, e.g., 136.6 CIDEr COCO Karpathy test split. Testing smartphone only single CPU, proposed LightCap fast inference speed 188ms per image, which ready practical applications.
منابع مشابه
Contourlet-Based Edge Extraction for Image Registration
Image registration is a crucial step in most image processing tasks for which the final result is achieved from a combination of various resources. In general, the majority of registration methods consist of the following four steps: feature extraction, feature matching, transform modeling, and finally image resampling. As the accuracy of a registration process is highly dependent to the fe...
متن کاملContrastive Learning for Image Captioning
Image captioning, a popular topic in computer vision, has achieved substantial progress in recent years. However, the distinctiveness of natural descriptions is often overlooked in previous work. It is closely related to the quality of captions, as distinctive captions are more likely to describe images with their unique aspects. In this work, we propose a new learning method, Contrastive Learn...
متن کاملStack-Captioning: Coarse-to-Fine Learning for Image Captioning
The existing image captioning approaches typically train a one-stage sentence decoder, which is difficult to generate rich fine-grained descriptions. On the other hand, multi-stage image caption model is hard to train due to the vanishing gradient problem. In this paper, we propose a coarse-to-fine multistage prediction framework for image captioning, composed of multiple decoders each of which...
متن کاملPhrase-based Image Captioning
Generating a novel textual description of an image is an interesting problem that connects computer vision and natural language processing. In this paper, we present a simple model that is able to generate descriptive sentences given a sample image. This model has a strong focus on the syntax of the descriptions. We train a purely bilinear model that learns a metric between an image representat...
متن کاملDomain-Specific Image Captioning
We present a data-driven framework for image caption generation which incorporates visual and textual features with varying degrees of spatial structure. We propose the task of domain-specific image captioning, where many relevant visual details cannot be captured by off-the-shelf general-domain entity detectors. We extract previously-written descriptions from a database and adapt them to new q...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence
سال: 2023
ISSN: ['2159-5399', '2374-3468']
DOI: https://doi.org/10.1609/aaai.v37i2.25359